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Abstract 

In this paper, we present algorithms and lower bounds for the Longest Increasing Subse- 
quence (LIS) and Longest Common Subsequence (LCS) problems in the data streaming model. 

For the problem of deciding whether the LIS of a given stream of integers drawn from 
{1, . . . , to} has length at least k, we discuss a one-pass streaming algorithm using O(fclogm) 
space, with update time either 0(log k) or <9(log log to). For the problem of returning the actual 
longest increasing subsequence itself, we give a |~log(l + l/e)]-pass streaming algorithm with 
update time O(logfc) or O(loglogm) that uses space 0(k 1+e logm), for any e > 0. We also 
prove a lower bound of fl(k) on the space required for any streaming algorithm for LIS, even 
when the input stream is a permutation of {1, ... , to}. 

We discuss a simple LIS-based algorithm for LCS, and we also give several lower bounds 
on this problem, of which the strongest is the following: when the elements of two n-element 
streams are presented in an adversarial order, we need space Ct(n/ ' p 2 ) to approximate the length 
of their LCS to within a factor of p. even when the two streams are permutations of each other. 

1 Introduction 

Longest increasing and common subsequences. Let S = xi,x%, . . . ,x n be a sequence of 
n integers. A subsequence of S is a sequence Xi 17 Xi 2 , . . . ,Xi k with %\ < ii < • • ■ < i^. Such a 
subsequence is said to be increasing if Xi x < Xi 2 < ■ ■ ■ < Xi k . In this paper, we consider two 
fundamental problems related to subsequences: 

• Longest Increasing Subsequence (LIS). Given a sequence S, find a maximum-length 
increasing subsequence of S (or find the length of such a subsequence) . 

• Longest Common Subsequence (LCS). Given two sequences S and T, find a maximum- 
length sequence x which is a subsequence of both S and T (or find the length of x). 

Both LIS and LCS are fundamental combinatorial questions which have been well-studied in the 
computer science community [4, 6, 11, 16, 17, 22, among many others]. 
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Among a large number of important applications of both of these problems, we highlight a few 
that arise in computational biology. The BLAST (Basic Local Alignment Search Tool) [3] database 
supports queries of the following form: for a sequence a of amino acids, for example, what segments 
of known proteins have high local similarity to a? Zhang [25] has proposed filtering the results 
of a BLAST query with an approach that uses an LIS algorithm as a black box to assemble the 
BLAST information about local similarity into a coherent picture of global similarity. An LIS step 
is also part of the MUMmer system for aligning entire genomes [8], and a straightforward LCS 
computation gives the value of the optimal alignment of two sequences of DNA [21]. 

The data streaming model. In the past few years, as we have witnessed the proliferation of 
truly massive data sets as diverse as fully sequenced genomes and the World Wide Web, traditional 
notions of efficiency have begun to appear inadequate. A polynomial-time algorithm — what is 
normally seen as the theoretical holy grail for a problem — may simply not be fast enough when run 
on an input like the multi-billion base pairs of the human genome. 

The theoretical computer science community has thus begun to explore new models of compu- 
tation, with new notions of efficiency, that more realistically capture when an algorithm is "fast 
enough." The data streaming model [15] is one such well-studied model. In this model, an algo- 
rithm must make a small number of passes over the input data, processing each input element as 
it passes. Once the algorithm has seen an element, it is gone forever; thus we must compute and 
store a small amount of useful information about the previously read input. We are interested in 
algorithms that use a sublinear amount of additional space. (With a linear amount of space, a 
streaming algorithm can simply store the entire input and then run a traditional algorithm.) We 
typically aim for a polylogarithmic amount of space and a polylogarithmic amount of processing 
time for each element of the input. Ideal data streaming algorithms make only a single pass over 
the data, but we are also interested in multipass streaming algorithms, in which the algorithm can 
make a small number (typically constant) of passes over the input data. 

Our results: LIS and LCS in the data streaming model. In this paper, we study the 
difficulty of finding longest increasing subsequences and longest common subsequences in the data 
streaming model. We are motivated in our exploration by the fact that LIS and LCS are both 
fundamental combinatorial questions; we believe that a solid characterization of the tractability of 
basic questions like LIS and LCS will lead to a greater understanding of the power and limitations 
of the data streaming model. 

One notable obstacle that we face in the LIS problem is that, unlike many problems that have 
been previously considered in the streaming model, the LIS of a stream is an essentially global 
order-based property. Many of the problems that have been considered in the streaming model — 
for example, finding the most frequently occurring items in a stream [7, 9], clustering streaming 
data [14], or finding order statistics for a given stream [2, 19] — are entirely independent of the order 
of the elements presented in S; permuting the order of the items in the stream does not affect the 
correct answers to these questions. The problem of counting inversions in a stream [1] — i.e., the 
number of pairs of indices {i,j) such that i < j but Xi > Xj — is an inherently order-based problem, 
but much more local than that of LIS in the sense that an inversion is a relation between exactly 
two items in the stream, whereas an increasing subsequence of length £ is a relation among £ items. 

In this sense, the LIS problem is more closely aligned to estimating the histogram of the 
stream [12, 13]. However, the solution to the LIS may be incredibly sensitive to small changes 
in the data. For instance, consider an LIS that consists primarily of the same repeated value. If 
we change the data stream so that many occurences of this value are slightly smaller, it radically 



changes the LIS. Similar notions apply to LCS as well. While this does not preclude efficient 
streaming algorithms for LIS or LCS, it does suggest some of the difficulties. 

In this paper, we first present positive results on (1) computing the length of the LIS of a 
given input stream, and (2) outputting a maximum-length increasing sequence. We give a one-pass 
streaming algorithm that uses 0{k\ogm) space to compute the length of the longest increasing 
subsequence for a given input stream, where m > maxxj is an upper bound on the largest element 
in the stream, and k is the length of the LIS. (This algorithm was also discovered independently 
by Fredman [11] and again by Bespamyatnikh and Segal [6], though not in the context of the 
data streaming model.) Our algorithm maintains values A[l . . . k'], where A[i] £ {1, . . . , m} is the 
smallest possible last element of all increasing subsequences of length i in the part of the stream 
that has already been read, and k' is the length of the LIS for the stream so far. As we read each 
element, we can update the array A in time 0(log k). This algorithm can also be implemented using 
van Emde Boas queues or y-fast trees to achieve an update time of O(loglogm) [23, 24]. For the 
problem of returning the length-fc LIS of a given stream, we give a one-pass streaming algorithm 
that uses 0(fc 2 logm) space. In the context of multipass streaming algorithms, we reduce the space 
requirement to 0{k l+e \ogm) by using [log(l + 1/e)] passes over the data. This is nearly optimal, 
since simply storing the LIS itself requires Cl(k) space. 

We also present lower bounds on the LIS problem in the streaming model. In the comparison 
model, Fredman [11] has proven that nlogn — n log log n + 0(n) comparisons are necessary and 
sufficient to compute the LIS of an n-integer sequence, via a reduction from sorting. To the best 
of our knowledge, however, no lower bounds on LIS in the streaming model have been shown 
previously. As with many lower bounds on problems in the streaming model, our results are 
based upon the well-observed connection between the space required by a streaming algorithm and 
communication complexity. Specifically, a space-efficient streaming algorithm A to solve a problem 
gives rise to a solution to the corresponding two-party problem with low communication complexity; 
one party runs A on the first part of the input, transmits the small state of the algorithm to the 
other party, who then continues to run A on the remainder of the input. We prove a lower bound 
of Q(k) for computing the LIS of a stream whenever n = S7(A; 2 ), by giving a reduction from the 
Set-Disjointness problem, which is known to have high communication complexity. 

For the LCS problem, we discuss a simple LIS-based algorithm requiring 0{n log m) space to 
compute the LCS of two n-element sequences presented as streams. If we want to compute the LCS 
of one n-element reference sequence against any number of test sequences, we can achieve the same 
space bound, independent of the number of test sequences. Our main results on LCS, however, are 
lower bounds. We prove that, if the two streams are general sequences, then we need fi(n) space to 
/j-approximate the LCS of two streams of length fi(n) to within any factor p. If the given streams 
are n-element permutations, we prove that we need 0(n//) 2 ) space to /^-approximate the LCS. 

2 Algorithms for Longest Increasing Subsequence 

We begin by presenting positive results on the LIS problem, both for computing the length of an 
LIS, and for actually producing an LIS itself. We use a dynamic- programming style algorithm, 
maintaining the last element of the "best" increasing subsequence of length i seen so far, for each 
i less than or equal to the length of the LIS seen so far. 

The algorithm presented here to calculate the length of the LIS was also discovered indepen- 
dently by Fredman [11] and by Bespamyatnikh and Segal [6] in a context other than the data 
streaming model; we include the algorithm here because our multipass algorithm to produce the 
LIS is an extension of it. 



compute-LIS(X) 




1 


A[0] := 


-1 




2 


A[l] := 


00 




3 


k' :=0 






4 


WHILE there are elements left in 


the stream X 


5 




Read in the next element 


Xi from X 


6 




Find £ such that A[£] < x 


i<A[£+l]. 


7 




Set A[£+l] :=Xi 




8 




IF £ + 1 > k' 




9 




Set k' :=k' + l 




10 




Set A[k' + 1] := oo 




11 


Output 


k' 





Figure 1: Pseudocode to compute the length of the LIS in a given stream X. 



2.1 Computing the Length of an LIS 



Let S 



xi,x 2 , 



,Xi,... be a stream of data, and consider a length- £ increasing subsequence 



a = X{ 1 ,Xi 2 , . . . , Xi t of S. Write last(cr) := xi r Let Oi denotes the ith element in a subsequence 
a. For instance, last(cr) := o\ a \. We say that a is (£,j)-minimal if last(cr) is minimized over all 
length-^ increasing subsequences of the substream x±,X2, ■ ■ ■ ,Xj. We will say that such a a is an 
(£,j) -minimal increasing sequence, or simply an (£,j)-MIS. 

Our algorithm for computing the length of the LIS is based on maintaining {£, j)-minimal 
subsequences for all £ £ {1, . . . , k'} as we scan the stream, where k' is the length of the longest 
subsequence in the stream so far. Specifically, the streaming algorithm works as follows: we main- 
tain an array A[l ...&'], where, after we have scanned the first j elements of the stream, A[£] will 
store last(cr) for an {£, j)-MIS a. The algorithm updates each A[£] as new elements from the stream 
arrive, and increases k! as appropriate. See Figure 1 for the pseudocode. 

Lemma 2.1 After i iterations of the while loop in compute-LISQ, we have 

J last(/9) for p an (£,i)-MIS if £ < LIS(xi, ■ ■ ■ ,x t ). 

\ oo or uninitialized otherwise 

Proof We proceed by induction on i, after strengthening the stated property by adding the following 
to the induction hypothesis: 

(*) A[j] < A[j'] for all j < j' such that yl[j], j4[j'] are initialized. 

For i = 0, the property is vacuously true. For the inductive case, assume the desired properties 
were maintained after we read in the element X{-\ from the stream. Now consider the moment at 
which we read the next element Xi from the stream. Let £ be such that A[£] < Xi < A[£ + 1], as 
in the algorithm. It is clear that only subsequences of length £ + 1 or higher might have a new 
smallest last element. That is, Xi is only going to affect values in A with indices £ + 1 or higher. 

On the other hand, note that Xi can only extend a previous increasing subsequence a if a ends 
with some element a\ a \ < X{. For all such subsequences, a\ a \ < x^ < A[£ + 1]. Hence by the 
induction hypothesis, a is of length £ or shorter. This implies that the sequence a 1 = a, Xi is of 
length at most £ + 1 . Thus x i can only affect values in A with indices £ + 1 or lower. 



Indeed, we now have a new subsequence a' of length £ + 1 with Xi as the last element. (We can 
extend the subsequence of length £ with last element -A[£]). So it is necessary and sufficient that 
we update A[£ + 1]. It is also clear that the new A[j]'s respect the ordering constraint. □ 

Theorem 2.2 We can decide whether the LIS of a given stream of integers from {1, . . . , m} has 

length at least a given number k, or compute the length k of the LIS of the given stream, with a one- 
pass streaming algorithm that uses 0{k\ogm) space and has update time Oilogk) or O (log log m). 

Proof By Lemma 2.1, the length is correctly computed by LIS. Clearly, the decision problem can 
also be solved with a minor change to the output of this algorithm. 

For the space bound, observe that we keep k values in the range {1, . . . , m}, i.e., O(logm) bits 
each. The only non-constant step in the update operation is to find the £ such that A[£] < Xi < 
A[£ + 1]. This can be done in 0(logk) time by binary search; alternatively, we can use a van Emde 
Boas queue [23] or y-fast trees [24] to support updates in O(loglogm) time. □ 

2.2 Finding an LIS 

The algorithm described in the previous section only computes the length of the LIS, but does 
not find such a sequence. We now present a multipass streaming algorithm that actually finds a 
longest increasing subsequence. Specifically, our algorithm finds the length-A; LIS of a stream using 
0(k 1+£ logm) space in [log(l + 1/e)] passes over the data. We first explain the one-pass version 
of the algorithm, and then subsequently generalize it to multiple passes. 

A one-pass algorithm. Consider an iteration in the decision algorithm in which we update, 
say, A[£ + 1] to X{. In other words, we have A[£] < Xi < A[£ + 1]. Then at this point, there 
is an increasing subsequence a of length £ + 1 whose last two elements are A[£] and Xi, since Xi 
appears later in the stream than A[j]. Unfortunately, at some future time the value A[j] may also 
be updated, and thus the old value is lost. (Thus, since the new A[j] is later in the stream than x^ 
was, we can no longer reconstruct the last two elements of a.) 

The straightforward fix for this difficulty is, for each £, to store the subsequence a of length 
£ that ends with A[£]. Thus, the algorithm maintains k sequences cr 1 ,...,^, taking a total of 
0{k 2 \ogm) space. When we update A[£ + 1] := Xj, we reset a +1 = a , x\. This adds only 
a constant amount of extra running time per update, so the update time per element remains 
0(logk) or O(loglogm), and the space requirement is 0(k 2 logm). 

A two-pass algorithm. We now describe a two-pass algorithm that requires less space. The key 
modification is that during the first pass over the data, the algorithm only remembers part of each 
a , specifically every gth element (for a value of q to be specified below). For each £, we maintain 

~£ _ i £ i i £ 

a —o-i,o- q+1 ,a 2q+ i,-.-,a,i-2, q+1 ,cr i 



<■/ 



where a\ = A[£], as before. The update rule for the first pass of the algorithm is then 



a 



!+l ._ 



a e ,Xi if £ = 1 (mod q) 

all-but-last(<5^),Xj otherwise, 



where all-but-last(5 : ) denotes the sequence a e with the last element of the sequence, A[£] = o~f, 
omitted. Note that the space required for this entire pass is 0(k 2 log m/q) when the length of the 
LIS is k. 



After the first pass is complete, we discard the subsequences a 1 , . . . , a k 1 , freeing a large amount 
of space. Thus the only information we retain is the subsequence 

a — <Jl,(J q+ l,(J2 q+ i, ■ • • , <7, fe_2, +1 ,<7 fc 

where <r fc is a length-A: LIS of the input. Write a k = z[l], z[2], . . . , z[[(k — 2)/q\], a\. 

In the second pass, we want to "fill in the blanks" of the subsequence a k to produce a k . 
Specifically, we want to find an increasing subsequence r that starts with z[£] and ends with zf^+l] 
for each £. Notice that we can do this sequentially (for one £ at a time), since two consecutive r 
subsequences do not overlap except at the endpoints. Thus each desired subsequence has length 
exactly q + 1, and the total space required for the entire second pass is 0(q 2 log m + Adogm). 

Overall, the total space required by our algorithm is 0(max(fc 2 logm/g, g 2 logm) + fclogm). 
This is minimized at q = A; 2 ' 3 , giving us a space bound of 0(/c 1+1 ' 3 logm) for two passes. 

Generalizing to a p-pass algorithm. We can generalize this idea to a larger number of passes 
by computing the T f ~ subsequences recursively. As before, in the first pass the algorithm remembers 
only every qth element in the subsequence, and discards all stored subsequences except a . Then 
the algorithm uses p — 1 passes to find the roughly k/q subsequences r 1 ^ 2 , . . . , r^ -2 )' 9 -!, where 
each t has length q. 

Let S{k,p) denote the space required by a p-pass algorithm to find a subsequence of length 
k. We then have the following recurrence: S(k,p) = max(0(fc 2 logm/q), S(q,p — 1)) + O(klogm). 
Solving the recurrence, we find that the space requirements are optimized at q = k 1 ~ 1 '( 2P ~ l > , and 
where S(k,p) = 0(k 1+1 ^ 2P ~^ logm). 

Theorem 2.3 Fix any e > 0. For a given k, we can find a length-k increasing subsequence of 
a given stream of integers from {l,...,m} with a |~log(l + l/e)~\-pass streaming algorithm that 
uses 0{k 1+e \ogm) space and has update time Oilogk) or O(loglogm). We can find the longest 
increasing subsequence of a stream even when its length k is not known in advance, using the same 
number of passes, the same update time, and space 0(-k 1+£ logm). 

Proof Given e > 0, we choose p = |~log(l + 1/e)] • Then the p-pass algorithm described above uses 
space 0(fc 1+e logm) to compute the LIS of the given stream. 

If k is unknown, then we modify the algorithm described above slightly. Define a recursive 
sequence by qo = 1 and qi+i = qi + q~ e for all i > 0. Then for the first pass only, change the 
update rule to the following: 

~i + i { & e ,Xi if £ = qi for some i 

\ all-but-last(<7^),Xi otherwise. 

So after the first pass, we have retained the sequence 

where t is the largest index such that qt < k. 

By the recursion, the gap between adjacent indices i, i + 1 < t for elements of a we have 
retained is c/j+i — % = q,~ £ < k l ~ £ . In the standard algorithm where k is known in advance, we 
also have a gap of A: 1_e . So we can resume the standard algorithm from the second pass on, using 
the same time and space requirements. 



Now, for the first pass of the algorithm, the update time is identical to the standard algo- 
rithm. However, the space used is 0(ktlogm). We now bound t. To this end, define 1$ = 
{ i : 2^ < qi < 2^ +1 }. Let i^ be the smallest index in 1^. Then for all j > 0, we see qi.+j > 



<li. 



+ JQl~ £ > Qu +j2* (1 ~ e) . Hence, \IA < (2^ +1 - 2*)/2^ 1 - £ ) < 2<K 



4> ' J ii<p — *H 

Let I = {i : qi < k}. By definition, t < \I\. From the above, we have 

^, T , , l 4^^ £ k^-i _ k^ 



I < > U < > 2^ = -^ < 



^i f\ - ^ 2 £ - 1 ~ eln2 

0=0 <f>=0 

where the last inequality follows since e x > 1 + x for all x. So the total space used in the first pass 
is 0{-k 1+e logm), as we wanted. □ 



3 Lower Bounds for LIS 

We now turn our attention to a lower bound on the space required for streaming algorithms solving 
the longest increasing subsequence problem. In this section, we prove that £l(k) bits of storage are 
required to decide if the LIS of a stream of N elements has length at least k, for any N = £l(k 2 ). 
Our proof is based on a reduction from the set disjointness problem, which is known to have high 
communication complexity: 

Definition 3.1 (Set Disjointness) In the Set-Disjointness problem, there are two parties A 
and B who wish to solve the following problem. Party A holds an n-bit string sa, and Party B 
holds another n-bit string sb- They must decide whether there is at least one '1 ' in the bitwise-and 
sa & sb of sa and sb (i.e., decide if sa and sb both have a '1' in at least one position) while 
minimizing the number of bits communicated between the parties. 

We will say that sa and sb intersect for a "yes" instance of Set-Disjointness. 

Lower bounds for the set disjointness problem are of fundamental importance, and have been 
studied extensively (e.g., [5, 18, 20]). The most recent results show that even in the randomized 
setting, Set-Disjointness requires a large number of bits of communication: 

Proposition 3.2 ([5]) Let 5 G (0,1/4). Any randomized protocol solving the Set-Disjointness 
problem with probability at least 1 — 5 requires at least j(l — 2v5) bits of communication, even when 
sa and sb both contain exactly n/4 ones. □ 

We now reduce Set-Disjointness to the problem of determining if an increasing subsequence 
of length \/N exists in a stream of N elements. This reduction shows that — even if we allow 
randomization and some chance of error — deciding whether there is an increasing subsequence of 
length k requires Vl{k) space in the streaming model. 

Suppose we are given an instance {sa, sb) of the Set-Disjointness problem, where n := \sa\ = 
\sb\- We will construct a stream lis-stream(s J 4, sb) whose longest increasing subsequence has length 
n + 1 if and only if (sa,sb) are non-disjoint. Further, the first half of lis-stream(s J 4, sb) depends 
only on sa, while its second half depends only on sb- With each index i £ {1, . . . , n}, we associate 
the sequence (n + 1) ■ (i — 1) + 1, . . . , (n + 1) ■ i, divided into two parts: the first i integers form the 
sequence A-part(i) = (n+ 1) • (i — 1) + 1, (n + 1) • {i— 1) +2, . . . , (n + 1) • (i — 1) + i and the remaining 
n—i+1 integers form the sequence B-part(i) = {n+l)-{i — 1)-M+1, (n+l)-(i— l)+i+2, . . . , {n+l)-i. 



Let lis-stream-A(s J 4) be the sequence consisting of A-part(i) for every i £ {i : sa(i) = 1}, 
in decreasing order of the index i. Similarly, let lis-stream-B(ss) be the sequence consisting of 
B-part(i) for every i £ {i : s_e(i) = 1}, also listed in decreasing order of the index i. Clearly, 
lis-stream-A(s j 4) (and lis-stream-B(ss), respectively) only depends on sa (sb, respectively). Then 
we define the stream \\s-stream(s a, sb) to be lis-stream-A^) followed by lis-stream-B(,S£). 

As an example (which we will return to throughout the paper), consider the 9-bit vectors 
ex A = [0,1,0,1,1,0,0,0,0] andex B = [1,0,0,1,0,0,1,0,0]. Then n = 9 and 

lis-stream(ex j4 ,ex B ) = 41,42,43,44,45, 31,32,33,34, 11,12, 

68,69,70, 35,36,37,38,39,40, 2,3,4,5,6,7,8,9,10. 

Observe the increasing subsequence 31, 32, . . . , 40 of length n + 1 = 10 in this stream. 

Lemma 3.3 The vectors sa and sb intersect if and only if L\S(lis-stream(sA, sb)) has length n+ 1. 

Proof We prove the obvious direction first. If s^(i) = s#(i) = 1 for some particular i, then observe 
that lis-stream(sA, sb) contains the increasing subsequence A-part(i) B-part(i) = (n + 1) • (i — 1) + 
1, (n + 1) ■ (i — 1) + 2, . . . , (n + 1) ■ i which contains n + 1 increasing integers. 

For the converse direction, we prove its contrapositive form. Suppose sa and sb do not intersect. 
Observe that whenever i < j we have that (1) A-part(i) follows A-part(j) in lis-stream-A(s J 4), and 
(2) the integers in A-part(z) are all smaller than those in A-part(j). Thus any increasing subsequence 
within lis-stream-A(s^) — or lis-stream-B(s^), similarly — has length at most n, and can contain only 
the integers from A-part(i) for only a single i. Thus the only potential increasing subsequences 
of length n + 1 must be subsequences of A-part(i) B-part(j') for some indices i and j so that 
SA(i) = sb(J) = 1. (By assumption, then, we must have i / j.) Furthermore, unless i < j, 
all the integers in A-part(i) are larger than the integers in B-part(j). Thus the longest increasing 
subsequence in lis-stream(s j 4, sb) is of length at most | A-part(i) | + | B-part(j) | = i + n — j + 1 < n. □ 

We can improve the construction so that the resulting stream LIS(lis-stream(s J 4, sb)) is a per- 
mutation, i.e., a stream containing each of the numbers of {0, 1, . . . , £} exactly once. We will show 
that a suitable £ = 0(n 2 ) suffices. Our construction is an extension of the above. We modify 
lis-stream-A and lis-stream-B as follows: we include the integers from A-part(z) and B-part(i) even 
when sa(i) = or s#(i) = 0, but so that only two of these elements can be part of an LIS: 

• Let V a = {x : x G A-part(i) for some i such that sa^i) = 0}. Then we define pad-A(s^) to be 
the sequence consisting of integers in Ua listed in decreasing order, followed by 0. We define 
lis-stream-perm-A(S' J 4) to be pad-A(s j 4) followed by lis-stream-A^^). 

• Similarly, let Ub = {x : x G B-part(i) for some i such that s^(^) = 0}. Then we define 
pad-B(s#) to be the sequence consisting of (n + 1) ■ n + 1, followed by the integers in Ub 
listed in decreasing order. We define lis-stream-perm-B(S'B) to be lis-stream-B(ss) followed by 
pad-B(s B ). 

Now define lis-stream-perm(s j 4, sb) := lis-stream-perm-A(s J 4) lis-stream-perm-B(sB). This stream 
consists of the "missing" elements of sa in decreasing order, followed by 0, then followed by the 
"present" elements; then the "present" elements of sb, followed by (n + 1) • n + 1, followed by the 
"missing" elements of sb- In our previous example, then, 

lis-stream-perm(ex j 4, ex^) = 89, • • • , 81, 78, • • • , 71, 67, • • • , 61, 56, • • • , 51, 23, • • • , 21, 1, 0, 

41,---, 45, 31, •••,34, 11,12, 
68, •••,70, 35, •••,40, 2, • • • , 10, 
91, 90, 80, 79, 60, • • • , 57, 50, • • • , 46, 30, • • • , 24, 20, ... , 13. 



One can easily verify that lis-stream-perm(s^, sb) is a permutation of the set {0, . . . , (n + 1) -n+1}. 

Lemma 3.4 The vectors sa and sb intersect if and only L\S(lis-stream-perm(sA, sb)) has length 
at least n + 3. 

Proof. Observe that the prefix of lis-stream-perm(s^, sb) ending with the element is a decreasing 
sequence, as is the suffix starting with the element (n + 1) -n + 1. Thus any increasing subsequence 
of lis-stream-perm(s^4, sb) can contain at most one element from each of these segments. Thus the 
following sequence must be a longest increasing subsequence of lis-stream-perm(s^, sb)'- first 0, then 
a longest increasing subsequence of lis-stream(s^, sb), then (n + 1) ■ n + 1. By Lemma 3.3, then, 
the length of the longest increasing subsequence of lis-stream-perm(s j 4, sb) is n + 3 if and only if sa 
and sb intersect. □ 

Theorem 3.5 For any length k and for any N > k ■ {k — 1) + 2, any streaming algorithm which 
decides whether LIS(<S) > k for a stream S which is a permutation of {1, . . . , N} with probability at 
least 3/4 requires Q(k) space. 

Proof. Suppose that an algorithm A(S) decides with probability at least 3/4 whether stream S, 
where \S\ = N, contains an increasing subsequence of length k. We show how to solve an instance 
(sa, sb) of the Set-Disjointness problem with \sa\ = k - 1 = \s B \ with probability at least 3/4 
by calling A. The stream we consider is 

S := N — 1, N — 2, . . . , k ■ (k — 1) + 2, lis-stream-perm(s J 4, sb). 

" v ' 

Extra Numbers 

Note that, as in the proof of Lemma 3.4, the longest increasing subsequence of S has exactly the 
same length as the longest increasing subsequence of lis-stream-perm(s j 4, sb) since the prepended 
elements of S are all larger than those in lis-stream-perm(s j 4, sb), and are presented in descending 
order. Thus, by Lemma 3.4, the LIS of S has length k — and A(S) returns true with probability at 
least 3/4 — if and only if sa and sb do not intersect. 

This immediately implies a lower bound on the space required by A by Proposition 3.2: to 
solve the instance {sa, sb) of the Set-Disjointness problem, Party A simulates the algorithm A 
on the stream Extra Numbers, lis-stream-perm-A(s^), then sends all stored information to Party B, 
who continues simulating A on the remainder of the stream S. By Proposition 3.2, then, Party A 
must transmit at least 0(/c) bits in this protocol, and thus A must use il(k) space. □ 



4 Longest Common Subsequence 

In this section, we turn to the LCS problem. Recall that for LCS we are given two streams Si and 
1S2, consisting of n\ and n<i integers, respectively, drawn from the set {1,2, . . . ,m}. Throughout 
this section, we consider the adversarial streaming model, in which elements from the two streams 
can be presented in any order of interleaving. Specifically, in the lower bounds that we construct 
in this section, the algorithm is given access to all of S\ before having access to any of £2- 

First, as with all streaming problems, observe that there is a trivial streaming algorithm that 
solves LCS using 0(ni logm + r^logm) space: we simply store both streams in their entirety, and 
then run a standard (non-streaming) LCS algorithm on the stored sequences. We can give another 
algorithmic upper bound for a version of LCS, based upon a simple connection between LIS and 
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LCS. Suppose that we are first given one reference sequence 1Z and then given a large number of test 
sequences Si,S 2 , • • • , S q ; we want to compute the LCS of 7Z and Si for all 1 < i < q. Our streaming 
algorithm stores the permutation 1Z as a lookup table, and then, for each Si, runs the LIS algorithm 
from Section 2, where we interpret two elements x and y to be in ordered x < y if x appears before y 
in 1Z. If these are n-element sequences, then this algorithm requires space 0{n log m) total space — 
0(n log m) to store 1Z, and 0{k\ogm) = 0{n log m) for the LIS computation. Note that this bound 
is independent of q. 

In the remainder of this section, we present some lower bounds for LCS, again using the Set- 
Disjointness problem. We first show an easy lower bound when Si and S2 are not necessarily 
permutations, and then show a more involved bound for exact or approximate computation of the 
LCS for permutations. 

4.1 Lower Bound on Exact and Approximate LCS for General Sequences 

It is straightforward to see that if we allow the streams S± and £2 n °t to be permutations of each 
other, then the lower bound is trivial, even for approximation: 

Theorem 4.1 For any length N and any approximation ratio p, any streaming algorithm which p- 
approximates the LCS of two streams Si,S2 (in adversarial order) each of length N with probability 
at least 3/4 requires £l(N) space, even when the algorithm is presented all of Si followed by all of 
S 2 . 

Proof. Let S be a sequence consisting of sequence Si followed by sequence S 2 , and suppose that an 
algorithm A(S) decides with probability at least 3/4 whether streams Si and S 2 contain a common 
subsequence of length 1. We show how to solve an instance (sa,sb) of Set-Disjointness with 
\sa\ = 4N = \sb\, where sa and sb both contain exactly iV ones, with probability at least 3/4 by 
using A. 

Let stream Si consist of all i such that SA(i) = 1, and let £2 consist of all i such that s^(i) = 
1. Thus Si and 52 have a common subsequence of length 1 if sa and sb have at least one 
element in common and of length otherwise. Thus, if .4(5) outputs the correct answer within any 
approximation ratio, it must distinguish between the case and the length 1 case. This implies the 
desired lower bound, since we can solve the Set-Disjointness using A. The first party simulates 
A on stream Si, then passes its state to the second party. The second party finishes simulating A 
on the rest of S, namely on S 2 . By Proposition 3.2, this state must therefore use £l(N) space. 

To show that we still require Q(N) space when one or both of the streams has length strictly 
larger than N, we simply add arbitrary new elements to each of the above streams. □ 

Although the above construction is for multiplicative approximation, a simple variation also shows 
that any data streaming algorithm solving this problem within additive a takes space at least 
£l(N/a); simply repeat each element in the streams 2a times. 

4.2 Lower Bound on Exact LCS for Permutations 

We now improve the construction to show a lower bound on the space required for an LCS algorithm 
even when the two streams Si and £2 are both permutations of the set {1, . . . , n}. 

Given an instance {sa, sb) of the Set-Disjointness problem where there are exactly n/4 ones 
in each sa and sb, we construct two streams as follows: 

• lcs-perm-A(s^) consists of the sequence Ra followed by the sequence Ra, where Ra contains 
{i : SA{i) = 1} in increasing order of i and Ra contains {i : SA(i) = 0} in decreasing order. 
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• lcs-perm-B(ss) consists of the sequence Rb followed by the sequence Rb, where Rb contains 
{i : ss(i) = 1} in increasing order and Rb contains {i : ss(i) = 0} in decreasing order. 

Lemma 4.2 The vectors sa and sb intersect if and only if LCS(lcs-perm-A(sA), lcs-perm-B(sB)) 
has length at least n/2 + 2. 

Proof If sa and sb intersect, then we can construct a common subsequence of Ics-perm-A (sa) and 
lcs-perm-B(sB) as follows. First choose the common element from Ra and Rb- Since sa and sb 
intersect, the set {i : SA(i) = sb(i) = 0} must contain at least n/2 + 1 elements, since there are 
exactly n/4 ones in each sa and sb- This implies a common subsequence of Ra and Rb of length 
n/2 + 1, and thus an overall common subsequence with total length n/2 + 2. 

On the other hand, if A and B have no common element, then none of the elements in Ra can 
be matched up with Rb- Of course, some elements in Ra might be matched with elements in Rb 
(or vice versa), but Ra is in increasing order while Rb is in decreasing order, so at most one such 
element can be matched. Also Ra and Rb have exactly n/2 common elements, so Ra can at best 
be matched with at most n/2 elements in Rb- Thus LCS(lcs-perm-A(s J 4), lcs-perm-B(ss)) can have 
length at most n/2 + 1. □ 

Theorem 4.3 For any length k and for any N > 2k — 4, any streaming algorithm which decides 
whether LCS(«Si,«S2) > k for streams <Si,<S2 which are permutations of {1, . . . ,N} with probability 
at least 3/4 requires fi(fc) space. 

Proof The theorem follows analogously to Theorem 4.1 when A" = 2 k — 4: deciding whether 
lcs-perm-A(s^) and lcs-perm-B(ss) have a common subsequence of length A^/2 + 2 = k requires 
£l(k) = £l(N) space, by Lemma 4.2. 

For larger N, we pad the streams, as in Theorem 3.5. Add the decreasing sequence N,N — 
1, N — 2, . . . , 2k — 4 + 1 to the beginning of lcs-perm-A(s J 4) and add the increasing sequence 2k — 
4 + 1, 2k — 4 + 2, . . . , N to the end of lcs-perm-B(s£). Then any common subsequences of these 
extended sequences are either (1) contained entirely in the unextended portions of Ics-perm-A^^) 
and lcs-perm-B(ss), or (2) have length at most one. Then, as before, the LCS has length k if and 
only if sa and sb intersect, and thus we require fi(fc) space to compute the LCS. □ 

4.3 Lower Bound on Approximating LCS for Permutations 

We now present lower bounds for the space required for approximation algorithms for LCS on 
permutations. Suppose that p is the desired approximation ratio. For each i, we will construct 
sequences p-approx-A(i,s^) and p-approx-B(i,s b) so that the two sequences have a common subse- 
quence of length p 2 if SA(i) = s B(i) = lj and so that the longest common subsequence has length 
at most p otherwise. For each i < n, both sequences are of length p 2 , and consist of integers from 
{(i — 1) • p 2 + 1, (i — 1) • p 2 + 2, . . . , (i — 1) • p 2 + p 2 }. We define them as follows: 

• For SA(i) = 1) define p-approx-A(i,s^) to be the increasing sequence (i — 1) • p 2 + 1, (i — 1) ■ 
p 2 + 2, . . . , (i — 1) • p 2 + p 2 . If SA(i) = 0, then define p-approx-A(i,Syi) to be the decreasing 
sequence (i - 1) • p 2 + p 2 , (i - 1) • p 2 + p 2 - 1, . . . , (i - 1) • p 2 + 1. 

• For ss(i) = 1, define p-approx-B(i,s b) to be the increasing sequence (i — 1) • p 2 + 1, (i — 1) • 
p 2 + 2, . . . , (i — 1) • p 2 + p 2 . When ss(i) = 0, we use a more complicated ordering of the p 2 
numbers. Specifically, we use what we call the median sequence a of these p 2 numbers so that 
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the longest increasing subsequence and the longest decreasing subsequence of a both have 
length exactly p. In this case, we define /5-approx-B(i,ss) to be the sequence 

(i - 1) • p 2 + p, (i - 1) • p 2 + p - 1, . . • , (i - 1) • p 2 + 1, 

(i - 1) • p 2 + 2p, (i - 1) • p 2 + 2p - 1, . . . , (i - I) • p 2 + p + 1, 

(i - 1) • p 2 + p 2 , (i - 1) • p 2 + p 2 - 1, . . . , (i - 1) • p 2 + (p - l)p + 1. 

Given an instance (sa, $b) of the Set-Disjointness problem where there are exactly n/A ones 
in each sa and sb, we construct two streams as follows: 

lcs-/9-approx-perm-A(s j 4) = /9-approx-A(l,s J 4), p-approx-A(2,SA), • • • , p-approx-A(n,s J 4) and 
lcs-/?-approx-perm-B(s£) = p-approx-B(n,s^), p-approx-B(n — 1,sb), . . . ,/J-approx-B(l,ss). 

Returning to our example from Section 3 where n = 9, we have 

lcs-2-approx-perm-A(ex A ) = 4,3,2,1, 5,6,7,8, 12,11,10,9, 13,14,15,16, 17,18,19,20, 

24,23,22,21, 28,27,26,25, 32,31,30,29, 36,35,34,33. 

lcs-2-approx-perm-B(ex i? ) = 34,33,36,35, 30,29,32,31, 25,26,27,28, 22,21,24,23, 

18,17,20,19, 13,14,15,16, 10,9,12,11, 6,5,8,7, 1,2,3,4. 

Lemma 4.4 If sa <^nd sb intersect, then LCS(lcs-p-approx-perm-A(sA),lcs-p-approx-perm-B (sb)) 
has length at least p 2 . If sa and sb do not intersect, then the length of the LCS is at most p. 

Proof. If sa and sb intersect, say with SA(i) = ss(i) = 1, then we see that p-approx-A(i,SA) = 
/9-approx-B(i,ss). Hence the sequence (i — 1) • p 2 + 1, . . . , i • p 2 has length p 2 and is a subsequence of 
both lcs-p-approx-perm-A(s J 4) and lcs-/9-approx-perm-B(s^). (In our example, 13, 14, 15, 16 is such a 
subsequence.) 

On the other hand, suppose sa and sb do not intersect. Recall that Ics-p-approx-perm-A (sa) lists 
the p-approx-A(-,s^) in increasing order, while lcs-p-approx-perm-B(ss) lists the p-approx-B(-,s#) in 
decreasing order. Thus any common subsequence can only have numbers that are a subsequence 
corresponding to exactly one index i. Since sa and sb do no intersect, we know that for any index 
i one of the three following cases holds: 

1. SA(i) = 1) s b(^) = 0. Then p-approx-A(i,s^) and p-approx-B(i,SB) have a longest common 
subsequence of length p, since one is an increasing sequence while the other is a median 
sequence. 

2. SA(i) = 0,ss(i) = 1- Then p-approx-A(i,s^) and /9-approx-B(i,ss) have a longest common 
subsequence of length 1, since one is a decreasing sequence while the other is an increasing 
sequence. 

3. SA(i) = 0,ss(i) = 0. Then p-approx-A(i,s^) and /9-approx-B(i,ss) have a longest common 
subsequence of length p, since one is a decreasing sequence and the other is a median sequence. 

Thus the LCS has length at most p when sa and sb do not intersect. □ 

Theorem 4.5 For any approximation ratio p, and for any N , any streaming algorithm which de- 
cides whether (i) LCS(<Si,<S2) > P 2 or (H) LCS(<Si,<S2) < p for streams Si, S2 which are permutations 
of {1, ... , N} with probability at least 3/4 requires £l(N/p 2 ) space. 

12 



Proof. As in our previous lower-bound theorems, we can solve an instance of the Set-Disjointness 
problem with \sa\ = N/p 2 = \sb\ as follows. By Lemma 4.4, deciding whether the constructed 
streams lcs-p-approx-perm-A(s J 4) and lcs-p-approx-perm-B(ss) have an LCS of length (i) at least 
p 2 or (ii) at most p corresponds to deciding whether sa and sb intersect. So a data stream 
algorithm A can be used to solve the Set-Disjointness problem. The first party simulates A on 
lcs- / o-approx-perm-A(s J 4), then passes the state of the algorithm to the second party. The second 
party finishes the simulation of A on lcs-/9-approx-perm-B(ss)- Again, by Proposition 3.2, this 
implies that we need £l(N/p 2 ) space for this LCS decision procedure. □ 

Corollary 4.6 To p- approximate the LCS of N -element permutations, we need £l{N/p 2 ) space. □ 

5 Conclusion and Future Work 

A classic theorem of Erdos and Szekeres follows from an elegant application of the pigeonhole 
principle: for any sequence S of n + 1 numbers, there is either an increasing subsequence of S of 
length \fn or a decreasing subsequence of S of length \/n [10] . One of our original motivations for 
looking at the LIS problem was to consider the difficulty of deciding, given a stream S, whether 
(1) the length of the LIS of S is at least y|£[, (2) the length of the longest decreasing sequence is 
at least y|<S|, or (3) both. To do this, one needs an exact streaming algorithm for LIS; a minor 
modification to the median sequence in Section 4 shows that one can have an LIS of length y/n or 
length \Jn — 1 with a longest decreasing subsequence of length \Jn or length \fn-\- 1, respectively. 
Of course, in the streaming model one is usually interested in approximate algorithms using, 
say, polylogarithmic space. Our lower bounds for LCS show that one needs a large amount of 
space for any reasonable approximation. However, our lower bounds for the LIS problem say that a 
streaming algorithm that distinguishes between an LIS of length k and one of length k + 1 requires 
fi(fc) space. It is an interesting open question whether one can use a small amount of space to 
approximate LIS in the streaming model. 
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